# Large-scale pretraining
Siglip2 Large Patch16 512
Apache-2.0
SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Text-to-Image
Transformers

S
google
4,416
8
Wav2vec2 Large Xls R 300m Ru
Apache-2.0
This model is a Russian automatic speech recognition (ASR) model fine-tuned on the common_voice_17_0 dataset based on facebook/wav2vec2-xls-r-300m, with a word error rate (WER) of 0.195.
Speech Recognition
Transformers

W
NLPVladimir
56
1
CLIP ViT H 14 Laion2b S32b B79k
MIT
This is a vision-language model based on the OpenCLIP framework, trained on the LAION-2B English subset, excelling in zero-shot image classification and cross-modal retrieval tasks.
Text-to-Image
C
ModelsLab
132
0
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
recallapp
17
0
Aimv2 1b Patch14 224.apple Pt
AIM-v2 is an image encoder model based on the timm library, with a scale of 1 billion parameters, suitable for image feature extraction tasks.
Image Classification
Transformers

A
timm
198
0
Eva Giant Patch14 Clip 224.laion400m
MIT
The EVA CLIP model is a vision-language model based on OpenCLIP and the timm framework, supporting zero-shot image classification tasks.
Text-to-Image
E
timm
124
0
Eva02 Large Patch14 Clip 224.merged2b
MIT
The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.
Image Classification
E
timm
165
0
Eva02 Enormous Patch14 Clip 224.laion2b Plus
MIT
EVA-CLIP is a large-scale vision-language model based on the CLIP architecture, supporting tasks such as zero-shot image classification.
Text-to-Image
E
timm
54
0
Vit Large Patch14 Clip 224.dfn2b
Other
A vision transformer model based on the CLIP architecture, focused on image feature extraction, released by Apple.
Image Classification
Transformers

V
timm
178
0
Seamless M4t V2 Large Speech Encoder
Speech encoder module extracted from SeamlessM4Tv2-Large, excelling in cross-language and multilingual sequence-level audio classification tasks
Audio Classification
Transformers Supports Multiple Languages

S
WueNLP
67
3
Vit Gigantic Patch14 Clip 224.metaclip 2pt5b
A dual-framework compatible vision model trained on MetaCLIP-2.5B dataset, supporting both OpenCLIP and timm frameworks
Image Classification
V
timm
444
0
Qwen2 Audio 7B
Apache-2.0
Qwen2-Audio is the Tongyi Qianwen large audio language model series, supporting both voice chat and audio analysis interaction modes.
Audio-to-Text
Transformers English

Q
Qwen
28.26k
114
CLIP ViT B 32 Laion2b S34b B79k
MIT
CLIP ViT-B/32 model trained on the LAION-2B dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
C
rroset
48
0
Owsm Ctc V3.1 1B
OWSM-CTC is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC, supporting multilingual speech recognition, speech translation, and language identification.
Speech Recognition Other
O
espnet
116
13
Chronos T5 Large
Apache-2.0
Chronos is a family of pretrained time series forecasting models based on language model architecture, which converts time series into token sequences through quantization and scaling for training, supporting probabilistic forecasting.
Climate Model
Transformers

C
amazon
156.60k
139
Whisper Large V3 Ft Cv16 Mn
Apache-2.0
A speech recognition model fine-tuned on the Common Voice 16.0 dataset based on OpenAI Whisper Large V3
Speech Recognition
Transformers

W
sanchit-gandhi
34
1
W2v Bert 2.0
MIT
A speech encoder based on the Conformer architecture, pretrained on 4.5 million hours of unlabeled audio data, supporting over 143 languages
Speech Recognition
Transformers Supports Multiple Languages

W
facebook
477.05k
170
Sentence Camembert Large
Apache-2.0
French sentence embedding model based on CamemBERT-large, providing powerful semantic search capabilities
Text Embedding French
S
Lajavaness
3,729
8
Vit H 14 CLIPA 336 Laion2b
Apache-2.0
CLIPA-v2 model, trained on the laion2B-en dataset, focusing on zero-shot image classification tasks
Text-to-Image
V
UCSC-VLAA
74
4
Metaclip L14 Fullcc2.5b
MetaCLIP is a large-scale vision-language model trained on 2.5 billion data points from CommonCrawl (CC), revealing CLIP's data filtering methodology
Text-to-Image
Transformers

M
facebook
172
3
CLIP ViT B 32 DataComp.XL S13b B90k
MIT
This is a CLIP ViT-B/32 model trained on the DataComp-1B dataset, designed for tasks like zero-shot image classification and image-text retrieval.
Text-to-Image
C
laion
12.12k
4
Ro Bart Large 512
This is a BART large model pretrained from scratch with 400 million parameters, specifically designed for Romanian language.
Large Language Model
Transformers Other

R
Iulian277
141
0
Pile T5 Large
Pile-T5 Large is an encoder-decoder model trained on The Pile dataset based on the T5x library, primarily used for English text-to-text generation tasks.
Large Language Model
Transformers English

P
EleutherAI
112
15
Dinov2 Giant
Apache-2.0
A vision Transformer model trained using the DINOv2 method for self-supervised image feature extraction
Image Classification
Transformers

D
facebook
117.56k
41
Idefics 9b
Other
IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.
Image-to-Text
Transformers English

I
HuggingFaceM4
3,676
46
CLIP ViT B 32 Laion2b E16
MIT
A vision-language pretrained model implemented based on OpenCLIP, supporting zero-shot image classification tasks
Text-to-Image
C
justram
89
0
CLIP ViT L 14 CommonPool.XL.clip S13b B90k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
C
laion
534
1
CLIP ViT L 14 CommonPool.XL S13b B90k
MIT
A vision-language pretrained model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
C
laion
4,255
2
CLIP ViT B 16 CommonPool.L.basic S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
57
0
CLIP ViT B 32 CommonPool.M.clip S128m B4k
MIT
Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality
Image-to-Text
C
laion
164
0
CLIP ViT B 32 CommonPool.S.laion S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
58
0
CLIP ViT B 32 CommonPool.S.image S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
60
0
CLIP ViT B 32 CommonPool.S.text S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
57
0
Arbertv2
ARBERTv2 is an upgraded BERT model trained on Modern Standard Arabic (MSA) with a corpus of 243GB text, containing 27.8 billion tokens.
Large Language Model
Transformers Arabic

A
UBC-NLP
267
6
Eva02 Large Patch14 Clip 224.merged2b S4b B131k
MIT
EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Image Classification
E
timm
5,696
6
Mt5 Multilingual XLSum Rust
An mT5 model fine-tuned on the XL-Sum dataset for 45 languages, designed for multilingual summarization tasks.
Text Generation Supports Multiple Languages
M
spursyy
18
3
CLIP ViT B 16 Laion2b S34b B88k
MIT
A multimodal vision-language model trained on the OpenCLIP framework, completed on the LAION-2B English dataset, supporting zero-shot image classification tasks
Text-to-Image
C
laion
251.02k
33
Maltberta
MaltBERTa is a large-scale pretrained language model based on Maltese text, using the RoBERTa architecture, developed by the MaCoCu project.
Large Language Model Other
M
MaCoCu
26
0
XLMR BERTovski
A language model pretrained on large-scale Bulgarian and Macedonian texts, part of the MaCoCu project
Large Language Model Other
X
MaCoCu
36
0
Model Facebookptbrlarge
Apache-2.0
A Brazilian Portuguese speech recognition model fine-tuned on the Common Voice dataset based on Facebook's wav2vec2-large-xlsr-53-portuguese model
Speech Recognition
Transformers

M
Vkt
22
0
- 1
- 2
- 3
Featured Recommended AI Models